15 research outputs found
Fisher's exact test explains a popular metric in information retrieval
Term frequency-inverse document frequency, or tf-idf for short, is a
numerical measure that is widely used in information retrieval to quantify the
importance of a term of interest in one out of many documents. While tf-idf was
originally proposed as a heuristic, much work has been devoted over the years
to placing it on a solid theoretical foundation. Following in this tradition,
we here advance the first justification for tf-idf that is grounded in
statistical hypothesis testing. More precisely, we first show that the
one-tailed version of Fisher's exact test, also known as the hypergeometric
test, corresponds well with a common tf-idf variant on selected real-data
information retrieval tasks. We then set forth a mathematical argument that
suggests the tf-idf variant approximates the negative logarithm of the
one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution
tail probability). The Fisher's exact test interpretation of this common tf-idf
variant furnishes the working statistician with a ready explanation of tf-idf's
long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision
An Ontology-Based Recommender System with an Application to the Star Trek Television Franchise
Collaborative filtering based recommender systems have proven to be extremely
successful in settings where user preference data on items is abundant.
However, collaborative filtering algorithms are hindered by their weakness
against the item cold-start problem and general lack of interpretability.
Ontology-based recommender systems exploit hierarchical organizations of users
and items to enhance browsing, recommendation, and profile construction. While
ontology-based approaches address the shortcomings of their collaborative
filtering counterparts, ontological organizations of items can be difficult to
obtain for items that mostly belong to the same category (e.g., television
series episodes). In this paper, we present an ontology-based recommender
system that integrates the knowledge represented in a large ontology of
literary themes to produce fiction content recommendations. The main novelty of
this work is an ontology-based method for computing similarities between items
and its integration with the classical Item-KNN (K-nearest neighbors)
algorithm. As a study case, we evaluated the proposed method against other
approaches by performing the classical rating prediction task on a collection
of Star Trek television series episodes in an item cold-start scenario. This
transverse evaluation provides insights into the utility of different
information resources and methods for the initial stages of recommender system
development. We found our proposed method to be a convenient alternative to
collaborative filtering approaches for collections of mostly similar items,
particularly when other content-based approaches are not applicable or
otherwise unavailable. Aside from the new methods, this paper contributes a
testbed for future research and an online framework to collaboratively extend
the ontology of literary themes to cover other narrative content.Comment: 25 pages, 6 figures, 5 tables, minor revision
Propagation connectivity of random hypergraphs
We study the concept of propagation connectivity on random 3-uniform hypergraphs. This concept is inspired by a simple linear time algorithm for solving instances of certain constraint satisfaction problems. We derive upper and lower bounds for the propagation connectivity threshold, and point out some algorithmic implications
A Simple Message Passing Algorithm for Graph Partitioning
Motivated by the belief propagation, we propose a simple and
deterministic message passing algorithm for the Graph Bisection problem
and related problems. The running time of the main algorithm is linear
w.r.t. the number of vertices and edges. For evaluating its average-case
correctness, planted solution models are used. For the Graph Bisection
problem under the standard planted solution model with probability pa-
rameters p and r, we prove that our algorithm yields a planted solution
with probability > 1 − δ if p − r = Ω(n−1/2 log(n/δ))
Finding Most Likely Solutions
As a framewrok for simple but basic statistical inference
problems we introduce a genetic Most Likely Solution problem, a task
of finding a most likely solution (MLS in short) for a given problem
instance under some given probability model. Although many MLS
problems are NP-hard, we propose, for these problems, to study their
average-case complexity under their assumed probability models. We
show three examples of MLS problems, and explain that “message passing
algorithms” (e.g., belief propagation) work reasonably well for these
problems. Some of the technical results of this paper are from the author’s
recent work [WY06, OW06]